Most research studying social determinants of health (SDoH) has focused on physician notes or structured elements of the electronic medical record (EMR). We hypothesize that clinical notes from social workers, whose role is to ameliorate social and economic factors, might provide a richer source of data on SDoH. We sought to perform topic modeling to identify robust topics of discussion within a large cohort of social work notes. We retrieved a diverse, deidentified corpus of 0.95 million clinical social work notes from 181,644 patients at the University of California, San Francisco. We used word frequency analysis and Latent Dirichlet Allocation (LDA) topic modeling analysis to characterize this corpus and identify potential topics of discussion. Word frequency analysis identified both medical and non-medical terms associated with specific ICD10 chapters. The LDA topic modeling analysis extracted 11 topics related to social determinants of health risk factors including financial status, abuse history, social support, risk of death, and mental health. In addition, the topic modeling approach captured the variation between different types of social work notes and across patients with different types of diseases or conditions. We demonstrated that social work notes contain rich, unique, and otherwise unobtainable information on an individual's SDoH.
translated by 谷歌翻译
Understanding geometric properties of natural language processing models' latent spaces allows the manipulation of these properties for improved performance on downstream tasks. One such property is the amount of data spread in a model's latent space, or how fully the available latent space is being used. In this work, we define data spread and demonstrate that the commonly used measures of data spread, Average Cosine Similarity and a partition function min/max ratio I(V), do not provide reliable metrics to compare the use of latent space across models. We propose and examine eight alternative measures of data spread, all but one of which improve over these current metrics when applied to seven synthetic data distributions. Of our proposed measures, we recommend one principal component-based measure and one entropy-based measure that provide reliable, relative measures of spread and can be used to compare models of different sizes and dimensionalities.
translated by 谷歌翻译
We propose AstroSLAM, a standalone vision-based solution for autonomous online navigation around an unknown target small celestial body. AstroSLAM is predicated on the formulation of the SLAM problem as an incrementally growing factor graph, facilitated by the use of the GTSAM library and the iSAM2 engine. By combining sensor fusion with orbital motion priors, we achieve improved performance over a baseline SLAM solution. We incorporate orbital motion constraints into the factor graph by devising a novel relative dynamics factor, which links the relative pose of the spacecraft to the problem of predicting trajectories stemming from the motion of the spacecraft in the vicinity of the small body. We demonstrate the excellent performance of AstroSLAM using both real legacy mission imagery and trajectory data courtesy of NASA's Planetary Data System, as well as real in-lab imagery data generated on a 3 degree-of-freedom spacecraft simulator test-bed.
translated by 谷歌翻译
In this paper, we extend previous self-supervised approaches for language identification by experimenting with Conformer based architecture in a multilingual pre-training paradigm. We find that pre-trained speech models optimally encode language discriminatory information in lower layers. Further, we demonstrate that the embeddings obtained from these layers are significantly robust to classify unseen languages and different acoustic environments without additional training. After fine-tuning a pre-trained Conformer model on the VoxLingua107 dataset, we achieve results similar to current state-of-the-art systems for language identification. More, our model accomplishes this with 5x less parameters. We open-source the model through the NVIDIA NeMo toolkit.
translated by 谷歌翻译
A reconstruction attack on a private dataset $D$ takes as input some publicly accessible information about the dataset and produces a list of candidate elements of $D$. We introduce a new class of data reconstruction attacks based on randomized methods for non-convex optimization. We empirically demonstrate that our attacks can not only reconstruct full rows of $D$ from aggregate query statistics $Q(D)\in \mathbb{R}^m$, but can do so in a way that reliably ranks reconstructed rows by their odds of appearing in the private data, providing a signature that could be used for prioritizing reconstructed rows for further actions such as identify theft or hate crime. We also design a sequence of baselines for evaluating reconstruction attacks. Our attacks significantly outperform those that are based only on access to a public distribution or population from which the private dataset $D$ was sampled, demonstrating that they are exploiting information in the aggregate statistics $Q(D)$, and not simply the overall structure of the distribution. In other words, the queries $Q(D)$ are permitting reconstruction of elements of this dataset, not the distribution from which $D$ was drawn. These findings are established both on 2010 U.S. decennial Census data and queries and Census-derived American Community Survey datasets. Taken together, our methods and experiments illustrate the risks in releasing numerically precise aggregate statistics of a large dataset, and provide further motivation for the careful application of provably private techniques such as differential privacy.
translated by 谷歌翻译
近年来,临床语言处理引起了很多关注,导致了新的模型或疾病表型,死亡率预测和其他任务的方法。不幸的是,这些方法中的许多方法都经过不同的实验设置(例如数据源,培训和测试拆分,指标,评估标准等)的测试,从而使其难以比较方法并确定最新方法。为了解决这些问题并促进可重复性和比较,我们通过一组四个临床语言理解任务,标准培训,开发,验证和测试集介绍了临床语言理解评估(线索)基准,从模拟数据以及软件中得出的测试集工具包。我们希望这些数据能够在方法之间进行直接比较,提高可重复性,并减少为这些临床语言理解任务开发新型模型或方法的进入的障碍。
translated by 谷歌翻译
基于机器的最先进的模型是建筑物建模和预测能量行为的流行选择,因为给出了足够的数据,即使在复杂性禁止分析描述的情况下,它们也擅长查找时空模式和结构。但是,基于机器学习的模型用于构建能源预测的模型难以推广到数据中未表示的样本外场景,因为它们的体系结构通常不符合与能源传递现象相关的机械结构的物理对应。因此,他们对看不见的初始条件和边界条件的预测能力完全取决于数据中的代表性,这在构建测量数据中不能保证。因此,这些限制阻碍了它们对现实世界工程应用的应用,例如数字双胞胎的能源管理。作为回应,我们提出了一个域名适应框架,旨在利用对建筑物中能量行为的现象的众所周知的理解,以预测除建筑物测量数据之外的样本场景。更具体地说,我们使用低级别的线性时间不变状态空间模型表示能量行为的机理知识,然后利用其管理结构来预测目标能源系统,仅可用建筑物测量数据。我们通过使在物理衍生的子空间保持一致,该物理衍生的子空间控制全球状态空间行为更接近于测量数据的目标子空间。在最初的探索中,我们专注于线性能源系统。我们通过改变源和目标系统的热物理特性,以证明机械模型从物理学到测量数据的可传递性来测试基于子空间的DA框架。
translated by 谷歌翻译
为了回应对新的基于AI的技术的社会,法律和道德影响的认识,AI和ML少校会议和期刊现在鼓励或要求提交的论文包括道德影响声明并接受道德审查。这一举动引发了关于伦理在AI和数据科学研究中的作用的激烈辩论,有时会变成适得其反的名称和“取消”的威胁。我们认为,更加关注数据科学家的道德教育可能有助于弥合分离数据科学界的意识形态鸿沟。我们将这种深厚的意识形态冲突诊断为原子主义者和整体者之间的一项冲突。除其他事项外,原子主义者认为,事实是并且应该与价值观分开的想法,而整体者认为事实和价值观是并且应该彼此之间的不可分割。我们的目标是鼓励跨学科和减少学科两极分化的目标,我们借鉴了从哲学和法律到社会理论和人文心理学等各种历史来源,以描述每个意识形态的信仰和假设。最后,我们呼吁数据科学界内的原子主义者和整体者在道德分歧期间表现出更大的同理心,并提出四种有针对性的策略,以确保数据科学研究受益社会。
translated by 谷歌翻译
2022年,乌克兰遭受了入侵,随着时间的流逝和地理位置的急剧影响。本文研究了使用分析以及基于区域的网络模型对持续中断对交通行为的影响。该方法是一种数据驱动的方法,该方法利用了在进化算法框架内获得的旅行时间条件,该算法框架在基于流量分配的自动化过程中渗透了原始过程的需求值。由于实施的自动化,可以为多个城市近似众多的每日模型。本文与先前发表的核心方法的新颖性包括一项分析,以确保获得的数据合适,因为由于持续的破坏,某些数据源被禁用。此外,新颖性包括将分析与中断时间表的直接联系,以新的方式检查相互作用。最后,确定了特定的网络指标,这些指标特别适合概念化冲突中断对交通网络条件的影响。最终目的是建立过程,概念和分析,以促进快速量化冲突情景的交通影响的更广泛的活动。
translated by 谷歌翻译
小天体的任务在很大程度上依赖于光学特征跟踪,以表征和相对导航。尽管深度学习导致了功能检测和描述方面的巨大进步,但由于大规模,带注释的数据集的可用性有限,因此培训和验证了空间应用程序的数据驱动模型具有挑战性。本文介绍了Astrovision,这是一个大规模数据集,由115,970个密集注释的,真实的图像组成,这些图像是过去和正在进行的任务中捕获的16个不同物体的真实图像。我们利用Astrovision开发一组标准化基准,并对手工和数据驱动的功能检测和描述方法进行详尽的评估。接下来,我们采用Astrovision对最先进的,深刻的功能检测和描述网络进行端到端培训,并在多个基准测试中表现出改善的性能。将公开使用完整的基准管道和数据集,以促进用于空间应用程序的计算机视觉算法的发展。
translated by 谷歌翻译